You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A reusable GitHub Action that tests agent skills in CI — run eval test cases against an LLM, grade responses against expectations, and catch regressions before merge.
What it does
Discovers skills in the repo (scans for SKILL.md files)
Loads test cases from tests/*.yml alongside each skill
Runs each test — sends the prompt to an LLM with the skill loaded as system context
Grades responses — binary pass/fail per expectation using LLM-as-judge (with retry on parse failure)
Reports results as a GitHub Actions job summary with pass rates, regressions, and cost estimates
Stores results in a Sanity dataset (optional) for baseline tracking and trend analysis
Fails the build if pass rate drops below threshold or regressions are detected
# tests/schema-advice.ymlprompt: "I need to create a blog post schema with title, author, and body content"expectations:
- "Uses defineType and defineField from sanity"
- "Includes a slug field with source set to title"
- "Uses reference type for author relationship"
- "Uses array of block type for body/content field"
- "Includes validation rules on required fields"tags:
- schema
- core
Key decisions
Decision
Choice
Rationale
LLM provider
Vercel AI SDK (ai package)
Multi-provider support — test skills across Claude, GPT-4o, Gemini with one Action
Test format
YAML files in repo
Human-readable, reviewable in PRs, no build step, aligns with GH Actions conventions
Baseline storage
Sanity dataset (optional)
Queryable via GROQ, supports trend analysis, shared backend with Skills Studio
Grading
LLM-as-judge with retry
Binary pass/fail per expectation, retries once on JSON parse failure for CI reliability
Should we add a skill-eval.yml workflow to this repo? We could dogfood the Action on the existing skills, but it needs an API key secret added to the repo.
Sanity dataset schema — The Action stores skillEvalResult documents. Should we deploy a schema for this, or let it be schemaless?
changed-only scope — Currently detects changed skills via git diff. If a shared reference file changes, only the skill containing it is re-evaluated. Should we expand to eval all skills when any reference changes?
Bundle size — The dist/index.js is ~3.1MB (Vercel AI SDK + provider adapters + Sanity client). This is within normal range for JS Actions but worth noting.
Review the following alerts detected in dependencies.
According to your organization's Security Policy, it is recommended to resolve "Warn" alerts. Learn more about Socket for GitHub.
Action
Severity
Alert (click "▶" to expand/collapse)
Warn
License policy violation: npm typescript
License: LicenseRef-W3C-Community-Final-Specification-Agreement - the applicable license policy does not allow this license (4) (package/ThirdPartyNoticeText.txt)
Next steps: Take a moment to review the security alert above. Review
the linked package source code to understand the potential risk. Ensure the
package is not malicious before proceeding. If you're unsure how to proceed,
reach out to your security team or ask the Socket team for help at
support@socket.dev.
Suggestion: Find a package that does not violate your license policy or adjust your policy to allow this package's license.
Mark the package as acceptable risk. To ignore this alert only
in this pull request, reply with the comment
@SocketSecurity ignore npm/typescript@5.9.3. You can
also ignore all packages with @SocketSecurity ignore-all.
To ignore an alert for all future pull requests, use Socket's Dashboard to
change the triage state of this alert.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Skill Eval GitHub Action
A reusable GitHub Action that tests agent skills in CI — run eval test cases against an LLM, grade responses against expectations, and catch regressions before merge.
What it does
SKILL.mdfiles)tests/*.ymlalongside each skillUsage
Basic (single provider)
Multi-model testing (matrix strategy)
With Sanity baseline tracking
Test case format
Test cases live in
tests/alongside each skill as simple YAML:Key decisions
aipackage)Inputs
provideranthropicanthropic,openai,google)api-keymodelclaude-sonnet-4-20250514grader-modelmodelsanity-tokensanity-project-idsanity-datasetskill-evalsskills-path./skillspass-threshold0.8fail-on-regressiontruemax-evals-per-skill20changed-onlytrueArchitecture
Open questions for the team
Should we add a
skill-eval.ymlworkflow to this repo? We could dogfood the Action on the existing skills, but it needs an API key secret added to the repo.Sanity dataset schema — The Action stores
skillEvalResultdocuments. Should we deploy a schema for this, or let it be schemaless?changed-onlyscope — Currently detects changed skills viagit diff. If a shared reference file changes, only the skill containing it is re-evaluated. Should we expand to eval all skills when any reference changes?Bundle size — The
dist/index.jsis ~3.1MB (Vercel AI SDK + provider adapters + Sanity client). This is within normal range for JS Actions but worth noting.Sample test cases included
skills/sanity-best-practices/tests/schema-advice.yml— Tests schema creation advice (5 expectations)skills/sanity-best-practices/tests/groq-query.yml— Tests GROQ query generation (5 expectations)Relation to Skills Studio
This Action and the Skills Studio eval system are complementary: